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ABSTRACT 

Ground-based optical surveys such as PanSTARRS, DES, and LSST, will produce large catalogs to 
limiting magnitudes of r > 24. Star-galaxy separation poses a major challenge to such surveys because 
galaxies — even very compact galaxies — outnumber halo stars at these depths. We investigate photo- 
metric classification techniques on stars and galaxies with intrinsic FWHM < 0.2 arcsec. We consider 
unsupervised spectral energy distribution template fitting and supervised, data-driven Support Vec- 
tor Machines (SVM). For template fitting, we use a Maximum Likelihood (ML) method and a new 
Hierarchical Bayesian (HB) method, which learns the prior distribution of template probabilities from 
the data. SVM requires training data to classify unknown sources; ML and HB don't. We consider 
i.) a best-case scenario (SVM5 est ) where the training data is (unrealistically) a random sampling of 
the data in both signal-to- noise and demographics, and ii.) a more realistic scenario where training is 
done on higher signal-to- noise data (SVM rea i) at brighter apparent magnitudes. Testing with COS- 
MOS ugriz data we find that HB outperforms ML, delivering ~ 80% completeness, with purity of 
~ 60 — 90% for both stars and galaxies. We find no algorithm delivers perfect performance, and that 
studies of metal-poor main-sequence turnoff stars may be challenged by poor star-galaxy separation. 
Using the Receiver Operating Characteristic curve, we find a best-to- worst ranking of SVM5 est , HB, 
ML, and SVM rea ^. We conclude, therefore, that a well trained SVM will outperform template-fitting 
methods. However, a normally trained SVM performs worse. Thus, Hierarchical Bayesian template 
fitting may prove to be the optimal classification method in future surveys. 



1. INTRODUCTION 



-r < 1.0 



r > 1.0 



Until now, the primary way that stars and galaxies 
have been classified in larg e sky surveys has been a mor- 
phological separation (e.g., Kron|1980[ [ Yee|1991 Vascon- 



[cellos et "ah 2011 Henrion et al.|2ffll| ) ol point sources 



(presumably stars) from resolved sources (presumably 
galaxies). At bright apparent magnitudes, relatively few 
galaxies will contaminate a point source catalog and rel- 
atively few stars will contaminate a resolved source cata- 
log, making morphology a sufficient metric for classifica- 
tion. However, resolved stellar science in the current and 
next generation of wide- field, ground-based surveys is be- 
ing challenged by the vast number of unresolved galaxies 
at faint apparent magnitudes. 

To demonstrate this challenge for studies of field stars 
in the Milky Way (MW) , we compare the number of stars 
to the number of unresolved galaxies at faint apparent 
magnitudes. Figure [T] shows the fraction of COSMOS 
sources that are classified as stars as a function of r mag- 
nitude and angular s ize. The COSMOS catalog ((l,b) 



(237,43) degrees, I Ca pak et al.||2007a| |Scoville et al 
2009| ) relies on 30- band photometry 
plus HST/ACS morphology for source classification (see 
Section 4 for details) . In Figure [I] we plot separately 
relatively bluer (g — r < 1.0) and redder (g — r > 1.0) 
sources because bluer stars are representative of the old, 
metal-poor main sequence turnoff (MSTO) stars gener- 
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Fig. 1. — The stellar fraction of COSMOS sources as a function 
of magnitude, for sources with g — r < 1 (left) and g — r > 1 
(right). Only stars and galaxies were included in this figure; Only 
a few percent of the COSMOS point sources are AGN. Colored 
curves indicate the upper limit in intrinsic full-width half-maximum 
(FWHM) allowed in the sample. Even in an optimistic scenario 
where galaxies with FWHM > 0.2 arcsec can be morphologically 
distinguished from stars, unresolved galaxies will far outnumber 
stars in point source catalogs at faint magnitudes. This challenge 
is much greater for blue stars than for red stars. 

ally used to trace the MW's halo while redder stars are 
representative of the intrinsically fainter red dwarf stars 
generally used to trace the MW's disk. We will see that 
the effect of unresolved galaxies on these two populations 
is different, both because of galaxy demographics and be- 
cause the number density of halo MSTO stars decreases 
at faint magnitudes while the number density of disk red 
dwarf stars increases at faint magnitudes. 

In an optimistic scenario in which galaxies with 
FWHM > 0.2 arcsec can be morphologically resolved 
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(the blue line in Figure IT] second from the top), un- 
resolved galaxies will stillgreatly outnumber field MW 
stars in a point source catalog. For studies of blue stars, 
field star counts are dominated by unresolved galaxies 
by r ~ 23.5 and are devastated by unresolved galaxies 
at fainter magnitudes. The problem is far less severe for 
studies of red stars, which may dominate point source 
counts for r < 24.5. Although morphological identifi- 
cation of galaxies with FWHM as small as 0.2 arcsec 
is better than possible for the Sloan Digital Sky Survey 
(median seeing ~ 1.3 arcsec), future surveys with higher 
median image quality (for example, 0.7 arcsec predicted 
for LSST) may approach this limit. 

Utilizing the fundamental differences between SEDs of 
stars and galaxies can mitigate the contamination of un- 
resolved galaxies in point source catalogs. In general, 
stellar SEDs are more sharply peaked (close to black- 
body) than galaxies, which exhibit fluxes more broadly 
distributed across wavelength. Traditionally, color-color 
cuts have been used to eliminate galax i es from point 
source catalogs (e.g., Gould et al. 1992 Reitzel et al. 
1998} |Daddi et al.||2004p . Advantages ol the color-color 
approach include its simple implementation and its flex- 
ibility to be tailored to the goals of individual studies. 
Disadvantages of this approach can include its simplistic 
treatment of measurement uncertainties and its limited 
use of information about both populations expected de- 
mographics. 

Probabilistic algorithms offer a more general and in- 
formative approach to photometric classification. The 
goal of probabilistic photometric classification of an as- 
tronomical source is to use its observed fluxes F to com- 
pute the probability that the object is of a given type. 
For example, a star (S) galaxy (G) classification algo- 
rithm produces the posterior probabilities p(S\F) and 
p(G\F) and decides classification by comparing the ratio 
of the probabilities 



O = 



P(S\F) 
P(G\F) 



(1) 



A natural classification threshold is an odds ratio, f2, of 
1, which may be modified to obtain more pure or more 
complete samples. 

Algorithmically there are a large number of approaches 
which produce probabilistic classifications. Generally, 
these fall into i) physically based methods — those which 
have theoretical or empirical models for what type of 
physical object a source is, or ii) data driven methods — 
those which use real data with known classifications 
to construct a model for new data. Physically based 
Bayesian and x 2 template fitting methods have been ex- 
tensively used to infer the properties of galaxies (e.g., 
Coil et al.||2004l |Ilbert et al.||2009l IXia et al] [20091 



Walcher et al. 11201 1| IHildebrandt et al. 120101). However 



in those studies relatively little attention has been paid 
to stars which contribut e marginally to overall source 
counts (although see Rob in et al.||2007| ). Several groups 
have recently investigated data driven, support vector 
machine based star-galaxy separation algorithms (e.g., 
Sagli a et al.||2012] Solarz et al. 2012} Tsalman tza et al. 
2012). 

Tn this paper, we describe, test, and compare two 
physically based template fitting approaches to star- 



galaxy separation (maximum-likelihood and hierarchical 
bayesian), and one data driven (support vector machine) 
approach. In Section 2, we present the conceptual basis 
for each of the three methods. In Section [3j we describe 
the COSMOS data set with which we test the algorithms. 
In Section |4j we discuss the specific details, choices, and 
assumptions made for each of our classification methods. 
Finally, in Section [5] we show the performance of the 
algorithms, and discuss the advantages and limitations 
related to their use as classifiers. 

2. PROBABILISTIC PHOTOMETRIC CLASSIFICATION 
TECHNIQUES 

2.1. Template Fitting: Maximum Likelihood (ML) 

One common method for inferring a source's proper- 
ties from observed fluxes is template fitting. This method 
requires a set of spectral templates (empirical or theoret- 
ical) that span the possible spectral energy distributions 
(SEDs) of observed sources. These template SEDs must 
each cover the full wavelength range spanned by the pho- 
tometric filters used to measure the fluxes to be fit. The 
relative template flux in each filter (for example ugriz) 
for each SED is computed by convolving each SED with 
each filter response curve. Once these relative flux val- 
ues are computed for each SED template, the template 
model is fully specified except for a normalization con- 
stant C. For a given observed source z, the value of C{ is 
proportional to the total luminosity of the source divided 
by the luminosity distance squared. This value of C{ is 
unknown but can be 'fit' to the data. 

The maximum likelihood (ML) value of C\ for each 
template that best fits a source's observed fluxes, F, is 
that which returns the lowest x 2 - After assessing the 
ML values of C{ for all the templates, classification is 
straightforward — one need only to compare the lowest 
star x 2 to the lowest galaxy x 2 - I n other words, X%_~ 
Xq = hi(Q) is the classification criteria (see Equation [I]) . 

2.2. Template Fitting: Hierarchical Bayesian (HB) 

Hierarchical Bayesian (HB) algorithms provide another 
template fitting-based approach to photometric classifi- 
cation. Unlike ML approaches, Bayesian approaches offer 
the opportunity to utilize information about how likely 
a source is to be each kind of star or galaxy; the differ- 
ent templates are not treated as equal a priori. With a 
hierarchical Bayesian algorithm, individual source prior 
probabilities do not need to be set in advance of the 
full-sample classification process; the entire sample of 
sources can inform the prior probabilities for each in- 
dividual source. 

Consider the scenario where a G model fits data Fi 
only slightly better than the best S model, while all other 
G models give poor fits and all other S models give nearly 
as likely fits. In this case, ignoring all other S models 
besides the best is the wrong thing to do, since the data 
are stating that S models are generally more favored. 
Capturing this kind of information is one primary aim of 
most Bayesian algorithms. 

To capture this information, we marginalize over all 
possible star and galaxy templates to compute the total 
probability that a source belongs to a certain classifi- 
cation (S or G). For a template fitting-based Bayesian 
algorithm, this marginalization consists of summing up 
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the likelihood of each S template given Fi, as well as 
the likelihood of each G template (across redshift). Note 
that the likelihood of each template is itself calculated as 
a marginalized likelihood. For each template fit, we com- 
pute the total likelihood of the fit by marginalizing over 
the uncertainty in fitting coefficient Q. This marginal- 
ization is the total probability of a Gaussian distribution 
with variance o 2 c . — a value which is returned using least 
squares fitting techniques (e.g., |Hogg et al.||"2010a| ). 

By Bayes' theorem, marginahzation requires we spec- 
ify the prior probability that any object might have a 
given SED template (at a given redshift). The prior 
probability distributions might be chosen to be uninfor- 
mative (for example, flat), informed by knowledge from 
outside studies, or informed by the data on all the other 
objects. The latter approach, referred to as a hierar- 
chical m^dej JL js_w2deJyused in statistical data analysis 
(e.g., Gelman ej^al. 2003) and is beginning to be used 



in astronomy (Mandel et al. 2009; Hogg et al. 2010b 
|Mandel et al.||20111 |shu et al.||2012| ). The benefits o: 
hierarchical approaches are many — b ecause every infer- 
ence is informed by every datum in the data set, they 
generally show improved probabilistic performance over 
simpler approaches, while requiring no additional knowl- 
edge outside the observed data and the template SEDs. 
Functionally, hierarchical approaches consist of parame- 
terizing the prior probability distributions (for example, 
with the mean and variance of a Normal distribution), 
and varying these parameters (known as "hyperparame- 
ters") to determine the probability of all the data under 
all the models. 

For our work, we optimize the hyperparameters of the 
SED template prior distributions to return the maximum 
marginalized likelihood of all the data. This procedure 
will enable us to simultaneously infer the star-galaxy 
probability of each source while determining the hyper- 
parameters that maximize the likelihood of the observed 
dataset. A brief description of the func tional form of 
these priors is given below in Section [42) Although we 
focus on the star-galaxy probabilities in this paper, the 
optimized hyperparameters themselves yield a measure- 
ment of the detailed demographics of a dataset. 

2.3. Support Vector Machine (SVM) 

A support vector machine (SVM) is a type of machine 
learning algorithm particularly well suited to the prob- 
lem of classification. SVM algorithms are frequently used 
in non- astronomical problems, and are considered a gold 
standard against which to compare any new classifica- 
tion method. SVM algorithms are "supervised", mean- 
ing they train on a catalog of objects with known clas- 
sifications to learn the high dimensional boundary that 
best separates two or more classes of objects. For classi- 
fication problems which do not separate perfectly, SVMs 
account for misclassification errors by looking at the de- 
gree of misclassification, weighted by a user specified er- 
ror penalty parameter. In general the optimal boundary 
need not be restricted to a linear hyperplane, but is al- 
lowed to be non-linear and so can require a very large 
number of parameters to specify the boundary. In order 
for non-linear SVM classification to be computationally 
feasible, a kernel function is used t o map the problem to 
a lower dimensional feature space ( |Boser et al.|| 1992). 



For the case of star-galaxy separation based on broad 
band photometry, the SVM algorithm learns the bound- 
ary which best separates the observed colors and appar- 
ent magnitude^] of stars and gala xies. For mor e details 
on the SVM technique, please see Miil ler et al] ( pOOl) . 

Successful implementation of a SVM algorithm re- 
quires a training dataset that is a sufficient analog to the 
dataset to be classified. A SVM has recently been applied 
to source cla ssification in the Pa n- STARRS 1 photomet- 
ric pipeline (Saglia et al. 2012[ ), with promising initial 
results. However, these results were obtained based on 
analysis of bright, high signal-to- noise data (r < 18), 
using training data which is a subset of the data itself. 

To investigate the impact of training set quality and 
demographics on the problem of star-galaxy separation, 
we will consider the utility and performance of SVM al- 
gorithms in a new classification regime, where the data is 
of lower signal to noise (described in Section [3]), and the 
number of unresolved galaxies is comparable to or larger 
than the number of stars. 

3. TEST DATA 

To investigate the advantages and disadvantages of 
star-galaxy classification techniques, we need a test cat- 
alog which has a large number of sources, is well un- 
derstood and calibrated, and for which spectroscopy 
or mult i- wavelength observations reveal the true source 
classifications. In addition, we want these data to be 
magnitude limited as faint as r > 24 in order to under- 
stand the problem of classification in current and upcom- 
ing surveys like Pan-STARRS 1, DES, and LSST. The 
COSMOS catalog satisfie s these requirements. 

The COSMOS survey flScoville et aT1|2007bD covers ~ 
2 square degrees on the sky using 30 band photometry, 
and is magnitude limited down to r ~ 25. Broadband 
ugrizJK photometry exists down to limiting magnitudes 
which complement the r limiting magnitude, and Spitzer 
IRAC coverage exist for sources as faint as K < 24 (|Ca- 
pak et al.ir2007b| |Sanders et"aT1|2007| |Taniguchi etaT 
2007). In addition, GALEX and XMM coverage are of 
sufficient depth to pick out relatively bright star-forming 
galax ies and AGN ( jHasinger et al.|2007| |Zamojski et al. 
2007). The spectral coverage beyond the optical, partic- 
ularly the near-infrared, can be a powerful discriminator 
between star and galaxy classification. For instance, [IF] 
bert et ah] ( |2009[ ) show the r — m^.Q^m vs. r — i colors 
cleanly separate star and galaxy loci, since stars have 
systematically lower r — ms6^m colors. In addition to 
30 band photometry, the COSMOS field has EST /AGS 
i— band coverage, down to a limiting magnitud e of i ~ 28 
( Koekemoer et ah]|2007 Scoville et al. 2007a| ). Diffrac- 
tion limited HST imaging allows the morphological dis- 
crimination of point-like and extended sources, further 
strengthening the fidelity of the COSMOS star-galaxy 
classification. 

We follow the COSMOS team's star-galaxy classifica- 
tion criteria in order to determine the 'true' classification 
for the purpose of testing our methods. These consist of 
a x 2 classification from fitting star and galaxy templates 
to the 30 band photometry, and a morphological classi- 
fication using the ACS_MU_CLASS st atistic derived by th e 
analysis of the HST photometry by Scarlata et al. (2007). 



We use apparent r magnitude here. 
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Fig. 2. — The color-color space distribution of point sources 
(FWHM < 0.2 arcsec) in the COSMOS catalog. It is clear that 
stars in the sample follow a tight locus in all slices of color-color 
space, while galaxies are more generally distributed. Even so, com- 
parison by eye reveals significant overlap between stars and galax- 
ies, particularly for bluer sources. 



We label COSMOS sources as stars if ACS_MU_CLASS says 
the source is pointlike, and the 'star' x 2 is lower than 
that for 'AGN/QSO'. For galaxies, we require the source 
to have a non-pointlike ACS_MU_CLASS. This classification 
assumes that all galaxies in the HST images are resolved. 
We view this as an excellent approximation of the truth 
- COSMOS ACS images are very deep (i ~ 28), and can 
thus detect the faint extended features of nearly unre- 
solved galaxies. We have qualitatively confirmed this by 
examining the distribution of galaxy FWHM, and find 
the distribution to be smoothly decreasing down to the 
smallest FWHM in the data. We estimate the number 
of galaxies labeled as stars to be below the few-percent 
level. For the labeling, we use an updated version of 
the publicly available photometric catalog, provided by 
P. Capak (private communication). While present in 
the catalog, we do not use any photometric redshift in- 
formation in determining the classification of COSMOS 
sources. 

Throughout this paper, we restrict our analysis to 
sources likely to be unresolved in ground based data 
< 0.2 arcsec). We do so since com- 
monly used morphological classification criteria will eas- 
ily distinguish quite extended sources, accounting for a 
majority of galaxies to depths of r ~ 24 — 25. However, 
galaxies with angular sizes < 0.2 arcsec are unlikely to be 
resolved in surveys with seeing > 0.7 arcsec, and so are 
an appropriate test bed for the type of sources which will 
rely the most on photometric star-galaxy separation. In 
total, our sample consists of 7139 stars and 9167 galaxies 
with apparent magnitudes 22.5 < r < 25, and is plotted 
in ugriz color-color space in Figure [2] Over this mag- 
nitude range, the median signal-to- noise in the r band 
ranges from ~ 50 at r = 22.5 to ~ 15 at r = 25, with 
lower corresponding ranges of 10 to 7 in the u. Of all 
18606 sources with FWHM< 0.2 arcsec in the COSMOS 
catalog, we identified 2300 AGN, which we discard from 
our current analysis. 



In this Section, we describe our implementation of ML 
template fitting, HB template fitting, and a SVM on the 
ugriz photometry of COSMOS sources for purposes of 
star-galaxy classification. 

4.1. ML Template Fitting 

Template based star-galaxy classification relies on the 
use of spectral energy distribution templates which (as 
well as possible) span the space of colors for both stars 
and galaxies. For our stellar model library, we first adopt 
the Pickles (1998) set of empirically derived SEDs, which 
span O to M type stars for both main sequence, giant, 
and supergiant stars. The vast majority of the SEDs 
in the Pickles library have solar abundances, so we sup- 
plement the li brary with theoretical S EDs from Castelli- 
Kurucz (CK) flCastelli fc Kurucz|2004|) We use CK mod- 
els with abundances ranging from —2.5 < [Fe/H] < 0.0, 
surface gravities ranging from 3.0 < log(g) < 0.0, and 
effective temperatures from 3500 < T e // < 10000 K. 
We include binary star templates by combining like- 
metallicity templates using flux calibrated CK models. 
Finally, we include SDSS M9 through L0 dwarf templates 
provided by J. J. Bochanski (private communication). 
T hese templates have been extended from the templates 
of Bochanski et al. (2007) into the near infrared, but 



lack data for wavelengths shorter than 4000 A. We ex- 
tend these templates down to the 3000 A using a main 
sequence CK model with T e ff = 3500K. Details of this 
extension are likely to be unimportant, since the flux of 
such stars between 3000 — 4000 A is negligible. Our final 
combined library of stellar templates includes 131 from 
the Pickles library, 25 6 from the CK library, 11 from 



Bochanski et al. (2007), and 1319 binary templates con- 
structed from the CK library, for a total of 1717 stellar 
templates. 

We select for our galaxy tem plates those used by the 
COSMOS team, described in (filbert et all 120091), pro- 



vided publiclv t hrough the Le Phare pho tometri c red - 



shift packag^j ( |Arnouts et al. 11999 



These templ ates consist of galaxy SEDs from Polletta 



Ilbert et al. 



2006) 



encompassing 7 elliptical and 12 spiral 
(SO-Sd) SEDs. Additionally, 12 representat ive starburst 
SEDs are included, which were added by lllbert et al. 
2009) to provid e a more exten s ive ra nge of blue colors. 



(2 
T 



templates from Polletta et al.| (|2007j) include effects of 
dust extinction, since they were selected to fit spectral 
sourc es in the VIMOS VLT Deep Survey (|Le Fevre et al.| 



2005). We do not consider any additional dust extinction 
beyond these fiducial templates. In order to model our 
galaxies across cosmic time, we redshift these templates 
on a discrete linear grid of redshift s, ranging from to 
4 in steps of 0.08. Simple tests using the ML procedure 
indicate small changes to the step size of our grid are 
unimportant. 

For all of the above templates, model fluxes were con- 
structed by integrating the SED flux density values with 
the throughput response curves for each filter. These 
consist of a u* response curve for the observations taken 
by the Canada- France- Hawaii Telescope, and g + , r + , z + , 
z + response curves for data collected by the Subaru tele- 
scope. We obtained the same response curves used by 



4. 



IMPLEMENTATION OF THREE STAR-GALAXY 
CLASSIFIERS 



5 http : //www . cf ht . hawaii . edu/°/ 7Earnouts/LEPHARE/lephare . html 
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Ilbert et ah] ( |2QQ9| ) through |Le P harep) To check for 
any mismatch between the data, calibrations, and/or re- 
sponse curves, we verified that model colors generated 
from the SEDs overlap well with the star and galaxy 
loci. 

4.2. HB Template Fitting 

While the HB template fitting tech nique builds on the 
foundation described in Section |4~T| the details of star- 
galaxy inference require significantly more mathematical 
formalism to thoroughly describe. We present the de- 
tails of this formalism and a detailed, step-by-step de- 
scription of our HB inferential procedure in Appendix 
A. Open-source C code is available at http : //githu b.com/ 

|rossf adely/star-galaxy-classif ication[ In this section, we 

qualitatively describe features specific to our HB algo- 
rithm. We emphasize that hierarchical Bayesian algo- 
rithms are unsupervised: we use no training set and do 
not set priors in ad vanc e of running the algorithms. As 
described in Section f2?2\ the priors for the templates are 
inferred from the data itself. 

Our HB template fitting method draws from the s ame 
set of SED templates described above in Section |4.1| 
However, to speed up the algorithm, we used only 250 
of the 1313 star templates, spanning a range of physical 
and color-color properties. In practice, we find the indi- 
vidual choice of these templates to be unimportant (since 
many are very similar) so long as the templates span the 
colors of stars, with a sampling close to or better than 
the typical color uncertainties of the data. We believe 
similar arguments to be true for galaxies, but have not 
explored such issues since we currently use only 31 galaxy 
templates. 

The primary choice we must make for our HB approach 
is the functional form(s) of the prior probability distri- 
butions in the model. Since our templates are discrete 
both in SED shape and physical properties, we parame- 
terize the prior probability of each template to be a sin- 
gle valued weight, within the range to 1, such that the 
weights sum to 1 (see, for example, A9 and A10). These 
weights themselves become hyperparameters m our op- 
timization. We thus have 281 hyperparameters corre- 
sponding to template priors since we use 250 star and 
31 galaxy templates. The overall prior probability that 
any given object is S or G is al so p arame terized as two 
weights that sum to one (A13 and A14 in Appendix), 
which we optimize. 

For the galaxy models, we must choose a form for our 
redshift priors. Ideally, these should be parameterized 
as weights for each discrete redshift, repeated as a sep- 
arate set for each galaxy template. Unfortunately, this 
would not only add 51 x 31 more hyperparameters to op- 
timize, but also significantly slows down likelihood com- 
putations. Instead, we adopt a flat prior distribution 
across redshifts. While not ideal, such a prior eases com- 
parison with ML classification results, and eliminates the 
need to specify an informative prior which correctly de- 
scribes the data. Tests of flat versus fixed-form prior 
distributions indicate that the classification results pre- 
sented in Section [5] do not vary substantially between the 
two choices. 

Finally, for each template fit we marginalize over the 
(Gaussian) uncertainty in the fit a mplit ude, for which 
we must specify a prior distribution (A6 and A7 in Ap- 



pendix). We adopt a log- normal prior for the fit am- 
plitudes, which we set by taking the mean and variance 
of the log-amplitudes from fits of all the data for a given 
template. This approach makes the priors essentially un- 
informative, since the variance for all the data is large 
with respect to the variance for data which is well fit 
by the template. Like redshift priors, these too could be 
treated as hyperparameters but come at the cost of much 
slower likelihood computations. 

In summary, we fix redshift and fit- amplitude pri- 
ors and vary the prior weights of the template and 
(5, G) probabilities. Thus, we optimize 283 prior (hy- 
per) parameters to values which yield the maximum like- 
lihood of the entire dataset. 

4.3. SVM Models 

We use the LIBSVM |^] — set of routines, described in 
Chang fc Lin| pOTTj). The provided routines are quick 
and easy to implement, and only require the user to spec- 
ify a training set of data, a set of data to be classified 
(a.k.a., test data), and the form and parameter values of 
the kernel function used. 

We employ a Gaussian radial basis function for the 
SVM kernel, for which we must specify a scaling factor 
7. Together with the error penalty parameter (Csvm) 
we have two nuisance parameters whose optimal values 
we need find. We do this by using a Nelder-Mead sim- 
plex optimization algorithm to find the parameter values 
which provide the highest number of correct classifica- 
tions in the test data. In detail, the optimal values for 
7, Csvm will be different for each combination of training 
and test data. 

To select the training data, we consider two scenarios. 
First is a 'best case' situation (SVM5 es£ ), where a well- 
characterized training set exists which is a fair sampling 
of the test data, with both the same object demographics 
and same signal-to-noise (S/N) as the data to be classi- 
fied. To emulate this scenario, we select the training set 
as a random sample of the COSMOS catalog. Second, we 
consider a more realistic case where the available training 
set is only sampling the demographics of the high S/N 
portion of the catalog to be classified (SVM rea j). In this 
case, the demographics of objects in the training set may 
not match the demographics of the majority of objects 
in the set to be classified. 

We consider SVM5 es ^ an optimistic scenario — 
obtaining a large spectroscopic or multi- wavelength sam- 
ple of training data, down to the limiting magnitude of 
a given survey, is very costly in terms of telescope time. 
The other extreme, SVM rea /, is a bit more realistic — for a 
given survey, classifications are typically easily obtained 
only at the high S/N end of the data. In both cases, 
we consider a training sample size which is a fifth of the 
total catalog size. 

Finally, to implement the SVM classification routine 
we need to scale both the training data and test data. 
That is, for the colors and apparent magnitude used, 
we must scale the range of each to lie between —1 and 
1. We map both training and test data to the interval 
[—1,1] using the full range of values in the test data. This 
is important in the case of SVM rea /, since the training 
data may not span the full range of values for the test 
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Fig. 3. — The completeness as a function of magnitude produced 
by the indicated classification approaches. Results for stars are on 
the left in green, while those for galaxies are shown on the right in 
blue. The thin, solid black line indicates the sample fraction for a 
given object type. For galaxies, the various methods return similar 
completeness values, while the discrepancy is much larger in the 
case of stars. 

data. We find that scaling can have a significant effect 
for the SVM rea | model. For example, poor classification 
performance is obtained if the SVM rea j training data is 
scaled to itself rather than to the test data. 

5. RESULTS AND DISCUSSION 

We report the classification performance of Maximum 
Likelihood (ML) and Hierarchical Bayesian (HB) tem- 
plate fitting, as well as a thoroughly tested Support Vec- 
tor Machine (SVM) on our COSMOS based test data. 
There are many different measures which can be used 
to assess the performance of each algorithm. First, we 
consider the completeness^ and purit}jjof classified sam- 
ples, evaluated at ln(fi Jj = 0. Figures [5] and [ZJ display 
the completeness and purity, respectively, as a function 
of magnitude. Examining Figure [3j all methods seem to 
be fairly competitive for galaxy classification, returning 
80 — 90% completeness across all magnitudes. SVMfc es t 
and ML yield the most consistently robust completeness 
for galaxies. In the case of stars, however, it is clear only 
our HB template fitting and SVM5 eS £ deliver acceptable 
completeness — at r > 24 the completeness of ML tem- 
plate fitting falls to 50% or below, and the complete- 
ness for SVM rea / goes to zero. The mismatch in source 
demographics between the realistic training set and the 
faint COSMOS sources severely undermines the efficacy 
of SVM reaZ . 

In terms of purity (Figure |4|, SVM5 es ^ outperforms 
all other approaches. For stars, HB yields similar per- 
formance to SVM5 eS £, but all approaches underperform 
SVM5 es£ in terms of galaxy purity. When taken in con- 
cert with the results of Figure [3J we see that HB delivers 
similar or better performance than ML in all cases, even 
with the relatively simple HB approach presented here. 
For stars, ML and HB yield similar sample purity, but 

7 Defined as the fraction of sources of true type X, correctly 
classified as X. 

8 Denned as the number of sources of true type X, correctly 
classified as X, divided by the total number of sources classified as 
X. 

9 Defined in Equation 1 
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Fig. 4. — Similar to Figure [3] but showing purity of classified 
samples, instead of completeness. Results for stars are on the left 
in green, while those for galaxies are shown on the right in blue. 
Here, SVM algorithms generally outperform all others, if given a 
very good set of training data (SVMb est ). For stars, our HB algo- 
rithm delivers somewhat similar purity to the SVMf, esi scenario. 
For galaxies, however, HB underperforms SVM^est as the stellar 
fraction of the sample decreases. 

HB does so with a much higher completeness (~ 80% vs. 
~ 50%). For galaxies, HB yields a consistently higher 
sample purity by ~ 10 — 15% but a consistently lower 
sample completeness by ~ 10%. 

We infer below that the performance achieved by the 
SVM5 est algorithm may represent the best possible clas- 
sification of stars and galaxies that could be done, based 
on single-epoch ugriz photometry alone. However, it is 
unlikely that an ideal training set will be available for 
object classification in future, deep datasets. Identifying 
the regions of ugirz color-color space where classification 
fails can highlight possible ways to improve the unsuper- 
vised HB (or ML) classification methods implemented 
here. For example, we want to check for regions of color- 
color space in which templates used in ML and HB may 
be missing, or to check whether the implementation of 
simple, but stronger, priors could increase performance. 

Figures [5] and [6] show the fraction of sources correctly 
classified using HB and SVM^ggt, distributed over colors. 
Comparing with Figure [2] reveals that the places where 
classification is least successful are regions where stars 
and galaxies overlap the most in color. For example, both 
the SVMfrest and the HB algorithm struggle to correctly 
identify galaxies with 1 < u — g < 3 and 1 < g — r <1.5. 
The number density of galaxies in the failing region is 
low, making HB even more likely to call everything a star. 
Similarly, both stars and galaxies populate u — g<l and 
g — r ~ 1, presenting a challenge to both SVM and HB 
algorithms. In this case, the number density of galaxies 
is higher than that of stars, making HB even more likely 
to call everything a galaxy and training SVM on a color 
separation that favors galaxies over stars. 

In the region of r — i > 1.5, the stellar locus has es- 
sentially zero overlap with galaxies in the sample. The 
SVMfrest algorithm yields exquisite classification of these 
stars, while the HB algorithm returns only a mediocre 
performance (although g — r < 1 and r — i > 1.5 is popu- 
lated with few stars, so those poorly classified regions do 
not represent a significant fraction of all stars). In future 
work, the classification of r — i > 1.5 stars could therefore 
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Fig. 5. — The fraction of objects correctly classified at ln(f2) = 
using our HB template fitting, distributed in ugriz color-color 
space. The top panel shows the performance on stars, and the 
bottonipanel shows the performance on galaxies. Comparing with 
Figure p] it is clear that classification is most successful for regions 
in whicn stars and galaxies do not overlap in color-color space. 

be improved with the implementation of stronger priors 
on the permitted redshifts at which galaxies may live — 
for example, by forcing a zero probability of elliptical 
galaxies at high redshifts. 

Locating regions of color space in which the classi- 
fiers struggle to correctly separate stars and galaxies not 
only helps to decipher weaknesses in classification al- 
gorithms, but can be used to identify the specific sci- 
ence cases which will be most highly impacted. To il- 
lustrate, we examine places where both SVM5 es ^ and 
HB underperform and compare these regions to the ob- 
ject types in our templates. For stars we identify two 



such regions. The first lies within 0.0 < 



< 1.0 



and 0.7 < g — r < 1.5, which has been suggested to 
be co mpris ed of white d warf, M dwarf binaries ( Sil vestri| 
et al ||2006| |Covey et aL|2007| ). The second region, with 
0.0 < u — g ^ 1.0 and 0.0 ^ g — r < 0.5, is consistent 
with metal-poor main- sequence turnoff stars. The rela- 
tively poorer performance in this region is particularly 
troubling, since these populations are some of the main 
tracers for low-surface brightness Galactic halo structure. 

For galaxies, association of underperforming regions to 
specific populations is less clear-cut. For instance, we 
find the poor performing region with 1.5 < u — g < 3.0 
consistent with S0/SA SEDs with redshifts less than 0.4, 
but also with dusty starbursting galaxies across a wider 
redshift range. While far from comprehensive, these as- 
sociations highlight the fact that classification perfor- 
mance can affect certain science cases more than oth- 
ers, and should be accounted for both during individual 
analyses and in future algorithm development. 

One of the great advantages of probabilistic classifica- 
tion is that one need not restrict the classification crite- 
rion to a fixed value. By moving away from ln(fi) = 0, 
one can obtain more/less pure or complete samples of 
stars and galaxies, depending on the user's science case. 
In detail, how the purity or completeness varies as a func- 
tion of ln(fi) depends on the algorithm used. To illus- 
trate, we show in Figure [7] how purity and completeness 
vary for the log odds ratio output by our HB algorithm. 
In the figure, as \n(Q) decreases, we are requiring that 
the relative likelihood that an object is a galaxy is much 
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Fig. 6. — The same as Figure[5] but for a SVM trained with data 
which span the S/N range of the whole sample (SVMb es t). The 
top panel shows the performance on stars, and the bottom panel 
shows the performance on galaxies. By inspection, it is clear that 
SVMb est outperforms HB template fitting, particularly in the case 
of galaxies. A striking difference is the poor galaxy classification 
of HB compared to SVM5 es t in u — g. This may indicate a model 
deficiency in the u spectral range of our galaxy templates. 




-2 2 

ln(Odds Ratio) or ln(fi) 

Fig. 7. — Hierarchical Bayesian template fitting results showing 
completeness (solid) and purity (dashed) lines as a function of lnf2. 
Results for stars are shown in green and galaxies are shown in blue, 
while the solid (dashed) curves show completeness (purity). Also 
indicated by green and blue horizontal lines is the relative fraction 
of stars and galaxies in the sample, respectively. The top panel 
shows the histograms associated with the distribution. Setting 
lnf2 >= 6 effectively calls all sources galaxies, so the galaxy com- 
pleteness is high, while the purity is set by the sample fraction of 
galaxies. The same conclusions are reached for stars at lnf2 < —6. 
The exact value of In fl chosen depends on the completeness and 
purity requirements dictated by the user's science case. 

higher than that for a star. Similarly, as ln(^) increases 
we are requiring objects be more stringently classified 
as a star. Thus, by moving away from ln(fi) = we 
change the star/galaxy purity and completeness to the 
point where everything is called a star or galaxy, giving 
100% complete samples with a purity set by the sam- 
ple fraction. One caveat, however, is that modifying the 
threshold ft to achieve more pure samples may select ob- 
jects which lie in particular regions in SED space. To 
illustrate, we show in Figure [8] the distribution of ln(^) 
in color space. 

We have considered the completeness and purity of sets 
of data classified as stars or galaxies (as a function of 
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Fig. 8. — The median ln(Q) of objects produced by our HB tem- 
plate fitting, distributed in ugriz color-color space. Similar to Fig- 
ure [5] regions with the most extreme ln(f2) values are primarily 
those which have little color-color overlap between stars and galax- 
ies. While altering the ln(f2) threshold can deliver more pure or 
complete samples (cf. Figure it may likely bias the sample to 
certain regions of color space. 

ln(fi)) as one means of comparing different classification 
algorithms. A strength of this approach to quantifying 
the efficacy of classification algorithms is its transparent 
connection to different science requirements, in terms of 
purity and completeness. A weakness of this approach is 
the impossibility of selecting an overall "best" algorithm 
that presents an average over competing scientific re- 
quirements. For example, Figure [3] shows that compared 
to SVM5 es t, our HB method gives better completeness in 
stars but slightly worse completeness for galaxies — which 
performs better in general? 

We assess the overall performance of the various clas- 
sification algorithms using the Receiver Operating Char- 
acteristic (ROC) curve. A ROC curve is a plot of the 
true positive rate versus the false positive rate of a bi- 
nary classifier, as the classification threshold (ln(^)) is 
varied. In Figure [9| we plot the ROC curve for all four 
classification approaches considered here. An ideal clas- 
sifier has a true positive rate equal to one for all values of 
ln(^). Thus, the Area Under the Curve (AUC) statistic 
is an assessment of the overall performance of the clas- 
sifier. There are several points worth noting in Figure 
[9] First, we find our HB approach to template fitting 
outperforms the ML approach. Considering our simple 
HB implementation is not very computationally demand- 
ing (tens of minutes on typical desktop computer), even 
a basic HB approach should always be preferred over 
the ML case. SVM algorithms, when trained with data 
which accurately capture the SED and S/N properties 
of the entire data, generally perform much better than 
our current template fitting methods. This is not sur- 
prising, since template driven algorithms are never likely 
to have as complete models as something data driven. 
In reality, available training data will likely only capture 
the high S/N end of the survey in question. As shown 
in Figure M a SVM rea ^ scenario underperforms even ML 
template ntting, casting severe doubt onto the useful- 
ness of SVM with ill-suited training information. Future 
surveys which intend to use supervised techniques, there- 
fore, will have to carefully consi der if alternate strategies 
for obtaining training data (e.g., Richards et al.|2Q12a|b[ ) 




0.4 0.6 

False Positive Rate 

Fig. 9. — The Receiver Operating Characteristic (ROC) curve 
for four photometric classification approaches: SVMfc est , SVM rea j, 
ML, and HB. The ROC curve shows the true positive rate versus 
the false positive rate, as ln(Q) varies. An ideal classifier always 
returns a true positive rate of one, so the Area Under the Curve 
(AUC) provides a general assessment of the performance. 

can outperform template fitting methods. 

6. CONCLUSIONS 

Imminent and upcoming ground-based surveys are ob- 
serving large portions of the sky in optical filters to 
depths (r > 24), requiring significant amounts of money, 
resources, and person power. In order for such sur- 
veys to best achieve some of their science goals, accu- 
rate star-galaxy classification is required. At these new 
depths, unresolved galaxy counts increasingly dominate 
the number of point sources classified through morpho- 
logical means. To investigate the usefulness of photo- 
metric classification methods for unresolved sources, we 
examine the performance of photometric classifiers us- 
ing ugriz photometry of COSMOS sources with intrinsic 
FWHM < 0.2 arcsec, as measured with HST. We have 
focused our analysis on the classification of full survey 
datasets with broad science goals, rather than on the 
classification of subsets of sources tailored to specific sci- 
entific investigations. 

Our conclusions are as follows: 

• Maximum Likelihood (ML) template fitting meth- 
ods are simple, and return informative classifica- 
tions. At hi(Q) = 0, ML methods deliver high 
galaxy completeness (> 90%) but low stellar com- 
pleteness (~ 50%). The purity of these samples 
range from ~ 50 — 95%, and are a strong function 
of the relative sample fraction. 

• We present a new, basic Hierarchical Bayesian 
(HB) approach to template fitting which outper- 
forms ML techniques, as shown by the Receiver Op- 
erating Characteristic (ROC). HB algorithms have 
no need for training, and have nuisance parame- 
ters that are tuned according to the likelihood of 
the data itself. Further improvements to this basic 
algorithm are possible by hierarchically modeling 
the redshift distribution of galaxies, the SEDs of 
the input templates, and the distribution of appar- 
ent magnitudes. 

• Support Vector Machine (SVM) algorithms can 
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deliver excellent classification, which outperforms 
template fitting methods. Successful SVM perfor- 
mance relies on having an adequate set of training 
data. For optimistic cases, where the training data 
is essentially a random sample of the data (with 
known classifications), SVM will outperform tem- 
plate fitting. In a more-realistic scenario, where 
the training data samples only the higher signal to 
noise sources in the data to be classified, SVM al- 
gorithms perform worse than the simplest template 
fitting methods. 

• It is unclear when, if ever, adequate training data 
will be available for SVM-like classification, HB al- 
gorithms are likely the optimum choice for next- 
generation classifiers. 

• A downside of a paucity of sufficient training data 
is the inability to assess the performance of both su- 
pervised (SVM) and unsupervised (ML, HB) clas- 
sifiers. If knowing the completeness and purity 
in detail is critical to the survey science goals, 
it may be necessary to seek out expensive train- 
ing/testing sets. Otherwise, users will have to se- 
lect the best unsupervised classifier (HB here), and 
rely on performance assessments extrapolated from 
other studies. 

• Ground based surveys should deliver probabilistic 
photometric classifications as a basic data prod- 
uct. ML likelihoods are useful and require very 
little computational overhead, and should be con- 
sidered the minimal delivered quantities. Basic or 
refined HB classifications require more overhead, 
but can be run on small subsets of data to learn 
the priors and then run quickly on the remaining 
data, making them a feasible option for large sur- 
veys. Finally, if excellent training data is available, 
SVM likelihoods should either be computed or the 
data should be made available. In any scenario, 
we strongly recommend that likelihood values, not 
binary classifications, should be delivered so that 
they may be propagated into individual analyses. 

The future of astronomical studies of unresolved 
sources in ground based surveys is bright. Surveys like 
PanSTARRS, DES, and LSST will deliver data that, in 
conjunction with approaches discussed here, will expand 
our knowledge of stellar systems, the structure of the 
Milky Way, and the demographics of distant galaxies. 
We have identified troublesome spots for classification 
in single-epoch ugriz photometric data, which may hin- 
der studies of M-giant and metal-poor main-sequence 
turnoff stars in the Milky Way's halo. Future studies 
could improve upon our preliminary results by implenet- 
ing more-sophisticated prior distributions, by identifying 
crucial improvements needed in current template models 
or training data, or by pursuing complementary non-SED 
based classification metrics. 
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APPENDIX 

HIERARCHICAL BAYESIAN STAR GALAXY CLASSIFICATION 

In this appendix, we provide a detailed, step-by-step description of our Hierarchical Bayesian algorithm. First, let 
us define the data as the sets: 

F = {10-§ mi Fi, , ...,10-§ m 'i^ , ...,10"§ miV F i v,o} 

<T J r = {?ln(10)Ficr mi , ... , - ln(10)F z <r mn ... , % ln(10)Fjv<r } , (Al) 
5 5 ' 5 

where cr rni is the observed magnitude and uncertainty in filter number / for N number of filters. One sequence 
for the filters / corresponds to {/} = {i£,g,r, i,z}. The zeropoint, F^o, is: 



*i,o = J A S\ R\jd\ , (A2) 

where S\ is the standard flux density spectrum (Vega or AB), and is the fraction of photons incident on the top 
of the atmosphere which are counted by the detector, as a function of wavelength. 
Next, we generate a model for the data using the templates: 



Fmod,l= J A /a, mod R\,l&^ , (A3) 

where /a, mod corresponds to the flux density of a given spectral template. Finally, we define a goodness of fit statistic: 



X 



2 _ ^ (jj ~ ^mod ^mod,l) 



where C mo d is a constant unitless amplitude applied to the model for the fit (discussed more below as CV/). The 
variance of otal j = g 2 Fi + r] 2 Ff, where r] is a few percent and represents a nuisance parameter which (in a global sense) 
accounts for error in the models as well as underestimates in o\ . The value of x 2 from our template fitting is the 
fundamental quantity on which our inference procedure is based, as follows below. 
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We represent the hypothesis that an object i is a star or a galaxy by "6"' or "G" respectively. For a given object z, 
we fit a set of templates j corresponding to S using the procedure outlined above. The likelihood of template j and 
amplitude C^ under the stellar hypothesis S given the single observed data point Fi is: 

p(F l \C lJ ,j,S)cxeM-lx 2 ) , (A5) 



where Fi is the full set of observations of object i and the associated noise model, and x 2 is defined in Equation A£ 
Note that the x 2 is not necessarily the best-fit value for x 2 but rather the x 2 obtained with template j when it is given 
amplitude Cij. 

We could optimize this likelihood, but really we want to compare the whole S model space to the whole G model 
space. We must marginalize this likelihood over the amplitude and template. To demonstrate this, let us step through 
each marginalization for the S model space. 

Marginalization over the amplitude Cij looks like 

p(F z | j, S,a) = J p(Fi\C iS , j, S) p{C %j \j, S, a) dC iS , (A6) 

where the integral is over all permitted values for the amplitude C^, and the prior PDF p(Cij\j, S, a) depends on the 
template j, the full hypothesis S. Note, the prior PDF obeys the normalization constraint 

l = J P (C ij \j,S,a)dC ij . (A7) 

Here we have also introduced some "hyperparameters" a, which are variables which parameterize prior distributions. 
The subset of hyperparameters ol which apply to p{Cij \j, 5, a) might be, for example, the mean and variance of a log- 
normal distribution on Cij. It is the simultaneous inference of the star-galaxy probabilities and the hyperparameters 
that make the approach hierarchical. 

Any realistic prior PDF for the Cij comes from noting that (for stars), the Cij are dimensionless squared distance 
ratios between the observed star and the template star; in this case the prior involves parameters of the stellar 
distribution in the Galaxy. When we look at galaxies (below), this situation will be different. In the (rare) case that 
the prior PDF p(Cij \j, 5, a) varies slowly around the best-fit amplitude, 

p(Fi | j, S, a) ex exp(- ^ x 2 ) p(Cij \j, <*) °aj , (A8) 

where x 2 is the best-fit chi-squared, Cij is the best-fit amplitude, and &cij is the standard uncertainty in Cij found 
by least-square fitting. This approximation is that the prior doesn't vary significantly within a neighborhood acij of 
the best-fit amplitude. 
Marginalization over the template space looks like 

p(F i \S,<x) = Y fP (F i \j,S)P(j\S,<x) , (A9) 

3 

where P(j\S,ot) is the prior probability (a discrete probability, not a PDF) of template j given the hypothesis S and 
the hyperparameters a. It obeys the normalization constraint 

i = J2pU\s,<*) . (A10) 

3 

Note P(j\S, a.) is a discrete set of weights, whose value corresponds to the hyperparameter for template j. 
To summarize, the marginalized likelihood p(Fi\S, a) that a source i is a star S is computed as: 

p(Fi\dj,j, S)ocexp(-^x 2 ) 
p(F i \j,S,a)= j p(F i \C ij ,j,S)p(C ij \j,S,a)dC ij 
p(F i \S,a)=J2p(Fi\j,S,a)P(j\S,a) . (All) 



The marginalized likelihood that source i is a galaxy G, is calculated following a very similar sequence. In calculating 
the likelihood, we allow a given galaxy template k to be shifted in wavelength by a factor 1 + z. This introduces another 
step in the calculation that marginalizes the likelihood across redshift for a template, giving 
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p(Fi\Cikz, k, z, G) ocexp(-- x 2 ) 

p(Fi|A;,2?,G,a) = ^ p(F ^G^.k, z, G)p(Cik z \k, z, G, a) dCi 
p(Fi\G, fc, a) = $>(^l fe > ^ G ) p (*l fc > G > a ) 



5>(F,|fc,G)P(fc|G,a) 



(A12) 



fc 



where now G^ is the constant amplitude for galaxy template k at a redshift z. The marginalization across redshift 
also introduces a prior P(z\k, G, a), which is also is parameterized by a subset of a, under some assumed form for the 
prior. 

This model is fully generative; it specifies for any observed flux F{ the PDF for that observation given either the 
star hypothesis S or the galaxy hypothesis G. We can write down then the full probability for the entire data set of 
all objects v. 



where even the overall prior probability p(S\ct) that an object is a star (or, conversely, a galaxy) depends on the 
hyperparameters a. These obey the normalization constraint 



The likelihood p({Fi} \ct) is the total, marginalized likelihood for the combined data set of all the observations Fi 
for all objects i. From here we can take a number of approaches. One option is to find the hyperparameters that 
maximize this total marginalized likelihood, or we can assign a prior PDF p(ct) on the hyperparameters, and sample 
the posterior PDF in the hyperparameter space. For computational reasons, we choose to optimize p({Fi} \ct) in this 
work. 

With either a maximum-likelihood set of hyperparameters a or else a sampling, inferences can be made. For our 
purposes, the most interesting inference is, for each object z, the posterior probability ratio (or odds) 



p({Fi} \a) = H \p(F l \S, a )p(S\cx) +p(F i \G, a )p(G\cx)} 



(A13) 



l=p(S\a)+p(G\a) 



(A14) 



_ p(S\Fi,a) 
l -p{G\F % ,a) 
p(S\F z ,a)=p(F z \S,a)p(S\a) 
p(G\F i ,a)=p(F i \G,a)p(G\a) 



(A15) 



where we have re-used most of the likelihood machinery generated (above) for the purposes of inferring the hyperpa- 
rameters. That is, the star-galaxy inference and the hyperparameter inferences proceed simultaneously. 



